Mining the Web for Bilingual Text
نویسنده
چکیده
STRAND Resnik is a language independent system for automatic discovery of text in parallel translation on the World Wide Web This paper extends the prelim inary STRAND results by adding automatic language identi cation scaling up by orders of magnitude and formally evaluating perfor mance The most recent end product is an au tomatically acquired parallel corpus comprising English French document pairs approxi mately million words per language
منابع مشابه
Mining Bilingual Data from the Web with Adaptively Learnt Patterns
Mining bilingual data (including bilingual sentences and terms 1 ) from the Web can benefit many NLP applications, such as machine translation and cross language information retrieval. In this paper, based on the observation that bilingual data in many web pages appear collectively following similar patterns, an adaptive pattern-based bilingual data mining method is proposed. Specifically, give...
متن کاملParallel Sentences Mining From The Web
Parallel sentences can benefit many NLP applications (e.g., machine translation, cross language information retrieval.) In this paper, the candidate bilingual webs pages are returned by submit sentence pairs to search engine and then validated by surface patterns. We propose an algorithm to candidate bilingual resource extraction and filter useless bilingual web pages. The pair sentences includ...
متن کاملDesigning a System for Trend Analysis of Users in Website Surfing in Iran Using Data Mining and Text Mining Algorithms
Background and Aim: As of the entrance of web surfing to the lifestyle of a vast majority of people in the society and the need for a more accurate social and cultural policy making in the field, authors intended to analyze the behavior of the society users in viewing different websites so as to help politicians and practitioners. Methods: Design science research method is used in this research...
متن کاملA Scalable Approach to Building a Parallel Corpus from the Web
Parallel text acquisition from the Web is an attractive way for augmenting statistical models (e.g., machine translation, crosslingual document retrieval) with domain representative data. The basis for obtaining such data is a collection of pairs of bilingual Web sites or pages. In this work, we propose a crawling strategy that locates bilingual Web sites by constraining the visitation policy o...
متن کاملLiterature Review: Mining the Web for Parallel Text: The STRAND System
This paper presents a short review of mining the web for parallel texts with an emphasis on the STRAND system. In Section 2 we start by trying to broadly define what is meant by the word corpus. After that, in Section 3 we give an overview of the World Wide Web as a source for collecting corpora, followed (in Section 4) by a discussion on related copyright issues. We then review some articles t...
متن کاملLiveTrans-Cross-Language Web Search through Live Mining of Query Translations
Enabling users to find effective translations automatically for query terms not included in dictionary is one of the major goals of a practical cross-language Web search service. This paper presents a cross-language Web search system called LiveTrans, which is an experimental metasearch engine that provides English-Chinese cross-lingual retrieval of both Web pages and images. The system has bee...
متن کامل